Multiphase comparative study for WHO/ISUP nuclear grading diagnostic model based on enhanced CT images of clear cell renal cell carcinoma

To compare and analyze the diagnostic value of different enhancement stages in distinguishing low and high nuclear grade clear cell renal cell carcinoma (ccRCC) based on enhanced computed tomography (CT) images by building machine learning classifiers. A total of 51 patients (Dateset1, including 41 low-grade and 10 high-grade) and 27 patients (Independent Dateset2, including 16 low-grade and 11 high-grade) with pathologically proven ccRCC were enrolled in this retrospective study. Radiomic features were extracted from the corticomedullary phase (CMP), nephrographic phase (NP), and excretory phase (EP) CT images, and selected using the recursive feature elimination cross-validation (RFECV) algorithm, the group differences were assessed using T-test and Mann–Whitney U test for continuous variables. The support vector machine (SVM), random forest (RF), XGBoost (XGB), VGG11, ResNet18, and GoogLeNet classifiers are established to distinguish low-grade and high-grade ccRCC. The classifiers based on CT images of NP (Dateset1, RF: AUC = 0.82 ± 0.05, ResNet18: AUC = 0.81 ± 0.02; Dateset2, XGB: AUC = 0.95 ± 0.02, ResNet18: AUC = 0.87 ± 0.07) obtained the best performance and robustness in distinguishing low-grade and high-grade ccRCC, while the EP-based classifier performance in poorer results. The CT images of enhanced phase NP had the best performance in diagnosing low and high nuclear grade ccRCC. Firstorder_Kurtosis and firstorder_90Percentile feature play a vital role in the classification task.


Machine learning
Renal cell carcinoma (RCC) accounts for more than 90% of renal cancers 1 , of which clear cell renal cell carcinoma (ccRCC) is the most common pathological type of RCC with poor prognosis, accounting for 70% of renal malignant tumors 2 .The standard-of-care treatment for ccRCC is closely related to the pathological grade 3,4 and metastatic disease relies on systemic treatments.Surgery is still considered as the standard and curative option for localized ccRCC.However, approximately 30% of initially diagnosed localized ccRCC patients have high-risk features and the risk of recurrence is above 40% 5 .For these patients, the implementation of intensive management has been reported to improve overall survival, and pre-operative neoadjuvant treatments with target therapies or immunotherapies have been investigated to decrease the risk of recurrence in high-risk localized ccRCC.Therefore, it is vital to distinguish suitable patients with high-risk features at early stages and select the optimal strategy.For now, several models using different factors have been proposed to define the high-risk group 6 .Among these predictive factors, pathological nuclear grade is extensively studied and considered to be one of the most important prognostic factors for ccRCC.And patients with high nuclear grade are at greater risk of postoperative recurrence and poor prognosis.At present, the WHO/ISUP pathological nuclear grading

Materials and methods
The overall flow chart of the study is shown in Fig. 1.The details of each step will be introduced one by one later.

Study patients
This study was approved by the Institutional Ethics Committee of Qilu hospital (Num: KYLL-202205-035-1).Considering that enhanced CT examination is a routine noninvasive technique for suspected patients with ccRCC, the requirement for informed consent was waived.Patients with preoperative enhanced CT scanning and pathologically proven ccRCC between January 2017 and December 2019 were included in the study.The selection criteria for patients are as follows: (a) patients with ccRCC diagnosed by pathology; (b) patients with enhanced CT scans in three periods before surgery; (c) images and clinical data are complete.The patient recruitment flowchart is shown in Fig. 2. In the first four days of data set 1, 76 cases were collected, and 10 cases with incomplete clinical data and 15 cases with incomplete enhanced CT data in three periods were excluded.Finally, we included 51 case studies for training.In addition, we also used the same method to collect 27 cases of data after 2019 as an independent verification set.

CT technique
We employed a SOMATOM Force CT scanner from Siemens Healthcare, located in Erlangen, Germany.Standard protocols were strictly adhered to for all patients.In brief, patients received 60-80 ml of non-ionic contrast agent, iopromide, administered intravenously into the forearm at a rate of 3-5 ml/s using a high-pressure syringe.The scanning protocol included the following phases: Corticomedullary Phase (CMP): The scanning delay time for this phase was set at 20-25 s.During this phase, contrast enhancement within the renal cortex and medulla is maximized, providing detailed imaging of the renal vasculature.Nephrographic Phase (NP): The scanning delay time for this phase was set at 50-60 s.This phase captures the peak enhancement of the renal parenchyma, allowing for visualization of renal structures and pathology.Excretory Phase (EP): Excretory phase scanning was performed 3-5 min after contrast injection.During this phase, contrast material is excreted into the renal collecting system and urinary tract, facilitating the visualization of urinary structures and evaluating for any excretory abnormalities.

Histopathological assessment of nuclear grade
The WHO/ISUP nuclear grades were collected from the histopathological reports at Qilu Hospital of Shandong University.Histopathological evaluations were performed by two genitourinary pathologists with over 10 years of experience in the WHO/ISUP grading system.These grades are determined according to the cytological and nuclear morphological characteristics of renal cell carcinoma.Usually, after histological examination and microscopic observation, the tissue samples of renal cancer are judged according to the cell atypia (cell size, shape, and nuclear characteristics) and the structural characteristics of the tumor (such as the arrangement of tubular structures).Through microscopic observation and histological analysis of tumor tissue, the morphological characteristics of different parts of cells are determined, and the highest grade observed is our label grade.Even so, we think there may still be some deviations, so we divide grades 1 and 2 into low grades and grades 3 and 4 into high grades to form our later experimental data.In addition, because our research only focuses on the best time to distinguish between high-grade and low-grade nuclei, this experimental setup can effectively find the answer, so we designed the whole experiment.

Resampling
In this experiment, because different patients may be photographed by different devices, the metadata of some CT images may be different, which may lead to incorrect feature extraction in the later stage.In order to avoid this problem, we first look at the metadata of CT images and find that the pixel spacing of CT images is slightly different.In order to unify the difference in layer thickness and in-plane resolution between images, the B-spline interpolation algorithm is used to uniformly sample all image data to an isotropic voxel size of 1 × 1 × 1 cm.The where W is the weight, and its expression is as follows where the value range of a is − 1 to 0, generally a fixed value is − 0.5, and p(i, j) • x and p(i, j) • y respectively represent the x coordinate and y coordinate of the point p(i, j) , and d is an intermediate substitution value.The weight w returns to update our initial pixel value to realize cubic B-spline interpolation algorithm.

Window width and level adjustment
Different human tissues or lesions have different CT values.If we want to observe a certain range of tissues, we should choose the appropriate window width and window level, and modifying the window width and window level is usually called the adjustment of contrast brightness.Therefore, improper window width and window level will lead to image blurring and interfere with our feature extraction and image marking of the kidney, so it is necessary to adjust window width and window level for CT images.In this experiment, according to the experience and the gray value of the image in the region of interest, we set the window level to 0 Hu and the window width to 400 Hu by using the SetWindow and SetOutput methods in the IntensityWindowingImageFilter package in SimpleITK.

Artifact correction and image denoising
Especially for images with poor imaging effects, unified artifact correction and image denoising are needed to avoid the extracted features containing a lot of noise.In this experiment, based on the histogram of the image, the distribution difference of pixel values in the image is used to detect the artifact region.In addition, the voxel values are discretized to reduce image noise and improve the calculation efficiency of imaging features.

Pixel value standardization
In addition, in order to speed up the convergence speed of model training, this experiment standardized the image pixel value to make its numerical range between 0 and 1.The formula is as follows: where x i represents the value of the image pixel, min(x) and max(x) represent the minimum and maximum values of the image pixel respectively.

Feature normalization
In addition, for the image genomics model, after extracting the image features, the features are also a set of data composed of numbers, similar to image pixels.Therefore, in order to unify the feature dimension and improve the training efficiency of the model, the experiment normalized the features.The processing method is the same as the pixel value standardization method.

Tumor segmentation and feature extraction
All CT images are exported from the Picture Archiving and Communication System (PACS) and then transferred to a separate workstation for manual segmentation using ITKSNAP (1) Vol.:(0123456789) and is one of the popular methods to improve the performance of classification models on imbalanced data 23 .First, each sample Xi is sequentially selected from the minority class samples as the root sample for synthesizing new samples; secondly, according to the up-sampling magnification n , a sample is randomly selected from k ( k is generally an odd number, such as k = 5 ) neighboring samples of the same category of Xi as an auxiliary sample for synthesizing a new sample, and repeated n times; then linear interpolation is performed between sample Xi and each auxiliary sample by the following formula, and finally n synthetic samples are generated.
where x i ∈ R d , x i,attr is the attr-th attribute value of the i -th sample in the minority class, attr = 1, 2, . . ., d ; γ is a random number between 1 ; Xij is the j-th neighbor sample of sample Xi , j = 1, 2, . . ., k ; x new represents a new sample synthesized between Xij and Xi .The characteristic dimension of our sample is 2-dimensional, so each sample can be represented by a point on the 2-dimensional plane.A new sample x new,attr synthesized by the SMOTE algorithm is equivalent to a point on the connecting line between the point representing the sample x i,attr and the point representing the sample x ij,attr .We implemented SMOTE using Python's sklearn library.Since our dataset is small, the setting of random seeds may cause large differences in oversampling and training results.To address this instability, we chose 30 random The number of seeds, between 0 and 300, and the average result is taken at the end.

Feature selection and model training
We use the wrapping RFECV feature selection method.The feature selection realized by the RFECV method is divided into two parts: the first is the RFE part, that is, recursive feature elimination, which is used to rank the importance of features; the second is the CV part, that is, cross-validation.After feature rating, the best number of features are selected through cross-validation.The specific process we use in this experiment is as follows: RFE stage (1) The initial feature set includes all available features.(2) Use the current feature set to model, and then calculate the importance of each feature.(3) Delete the least important feature(s) and update the feature set.(4) Skip to step 2 until the importance rating of all features is completed.

CV stage
(1) According to the importance of features determined in the RFE stage, different numbers of features are selected in turn.(2) Cross-check the selected feature set.(3) Determine the number of features with the highest average score and complete feature selection.By building SVM 24 , RF 25 , and XGB 26 classifiers, using five-fold cross-validation to train the model and perform grid parameter optimization in each fold training set, using SMOTE oversampling.The feature weight coefficient of liner SVM and the feature Gini coefficient of RF and XGB are used as the basis for RFECV to filter the feature set.Then the results are averaged by training 30 times with 30 oversampled random seeds, and the average is used to evaluate model performance.

Support vector machine
Given a data set D = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n )} , where x i ∈ R m represents the m-dimensional feature data, y i ∈ {−1, +1} represents the classification label, and n is the total number of samples.Then the optimal hyper- plane can be expressed as: where the column vector W = (W 1 , W 2 , . . ., W m ) T is the normal vector, which determines the direction of the hyperplane; b is the displacement term, which together with the normal vector W determines the distance between the hyperplane and the origin.In order to maximize the classification interval, the objective function is: where C > 0 represents the penalty factor, and the larger C means the greater penalty for misclassification.(7)  x new,attr = x i,attr + x ij,attr − x i,attr × γ

Random forest
When classifying, the RF classifier h i predicts a label from the set {c, c 2 , . . ., c n−1 , c n } that has been labeled, and expresses the predicted output of h i on the sample x as an N-dimensional vector (h 1 i (x); h 2 i (x); . . .h N i (x)), where h j i x is the output of h i on the category label c j , the voting process of RF is publicized as: In the formula, n tree is the number of trees in the RF classifier; I( * ) is an indicative function; (n hi , c) is the classification output of the tree for type c j ; n hi is the number of leaf nodes of the tree.Therefore, RF is a highly generalized integrated learning model.

XGBoost
Assuming that there is a data set containing independent variables X = {x 1 , x 2 , . . ., x m } , classification variables y i and n samples, K CART trees are obtained by training them, and the final predicted value ŷi is the accumula- tion of these tree models value: In the formula, f (x) represents a CART regression tree, w k is the leaf weight of the k-th regression tree, q(x) is the number of the leaf node, that is, the sample x will finally fall on a certain leaf node of the tree, its value is w q(x) .XGB adds a regular term �(f ) to the traditional loss function, and uses incremental learning to optimize the function, that is, starting from a constant, each round of training adds a new function on the basis of keeping the previous round of model unchanged, such as the predicted value j , γ and are the parameters of the regular term, and T represents the num- ber of leaf nodes.XGB can improve the classification accuracy after combining many weak classifiers to form a powerful combined classifier.

Statistical analysis
All statistical analyses were performed using Python (version 3.6.5).The group differences were assessed using a Mann-Whitney U test for continuous variables.The Mann-Whitney U test can be regarded as a substitute for the independent two-sample T test or the corresponding large-sample normal test for the parameter test of the difference between two means.Because the Mann-Whitney U test explicitly considers the rank of each measured value in each sample, it uses more information than a symbolic test.The original hypothesis that H0 is the distribution of two populations from which two independent samples come has no significant difference.If the probability p value is less than or equal to a given significance level α , the original hypothesis is rejected and the distribution of two populations from which samples come is considered to be significantly different.On the other hand, the original hypothesis is accepted, and there is no significant difference in the distribution of the two populations from which the samples come 27,28 .In this experiment, a p value less than 0.05 indicates a statistically significant difference.
In the classification task, the prediction situation is divided into true positive (TP), true negative (TN), false positive (FP), and false negative (FN).Here, TP denotes the number of cases correctly predicted as positive, FP denotes the number of cases wrongly predicted as positive, TN denotes the number of cases correctly predicted as negative, and FN denotes the number of cases wrongly predicted as negative.The aforementioned four cases can be expressed using a confusion matrix to evaluate the classification performance of the model.Based on the confusion matrix, the proportion of all correct prediction results in the total sample can be calculated using accuracy (ACC) as an evaluation indicator.The higher the index is, the more accurate the model prediction is, and the fewer samples are wrongly predicted.Specificity indicates the sensitivity to negative samples.Essentially, the higher the accuracy rate, the lower the false detection rate.
Sensitivity, also known as true positive rate, Recall rate, indicates the proportion of correctly predicted positive samples to actual positive samples, and the higher the sensitivity, the lower the missed detection rate.
F1 value (F1-measure) comprehensively indicates the accuracy and sensitivity results.The more proximal the index is to one, the more satisfactory the output result of the model.( 11) The receiver operating characteristic (ROC) curve based on sensitivity and specificity plots TPR on the y-axis and 1 − TNR on the x-axis.The area under the curve (AUC) value, representing the area under the ROC curve, is utilized to evaluate the classification ability of the model.A higher AUC value signifies superior model performance and enhanced classification ability.Finally, through the analysis of the results, we can get the best period of diagnostic nuclear grading.

Ethical approval
The study involving human subjects was reviewed and approved by the Ethics Committee of the Qilu Hospital (No: KYLL-202205-035-1).

Consent statement
All methods were performed in accordance with the relevant guidelines and regulations.All participants gave their written informed consent.

General characteristics
The training set for this study included 41 patients in the low-grade group and 10 patients in the high-grade group, and the baseline characteristics are shown in Table 1.It contains eight clinical features: Age (years), Gender, Weight (kg), Height (CM), BMI, Tumor Side (left and right), Tumor Location (upper pole, middle, and lower pole) and Tumor Size (cm).The significant difference between the two groups of data was analyzed, and the results showed that there was no significant difference in age (p = 0.5141), gender (p = 0.3717), weight (p = 0.7291), height (p = 0.5678), and BMI between two groups.Moreover, the difference in tumor side (p = 0.5553) and tumor location (p = 0.4593) was not significant, but patients in the high-grade group have significantly larger tumor size compared with those in the low-grade group (p = 0.043 < 0.05).It shows that the size of the tumor has a significant influence on the distinction between high-grade and low-grade groups in the clinic.

Feature extraction and selection
Texture features are extracted from the region of interest (ROI) of multi-phase CT images, the results of manual tumor segmentation at different periods are shown in Fig. 3.In total, we extracted 100 radiomics features from each phase.In each stage, three classifiers, SVM, RF, and XGBoost, are used for feature selection in combination with RFECV.The model is trained 30 times according to the 30 random seeds selected by SMOTE oversampling, and a feature subset is obtained by training RFECV each time.The number of occurrences of each feature in the 30 feature subsets is selected, and the features that appear no less than 15 times are chosen for analysis.In data-set1 and dataset2, the features selected using the three classifiers for the CMP, EP, and NP epochs, respectively, are shown in Supplementary Tables 2-7.In dataset1, 19, 4, and 3 features were selected by SVM, RF, and XGB classifiers in NP, respectively.Moreover, 4, 15, and 3 features were selected in CMP, respectively, and 8, 8, and 3 features were chosen in EP.Which will further verify the robustness and reproducibility of the selected features.

Model training and results analysis
In this study, we constructed three classifiers, SVM, Random Forest, and XGBoost for the three stages of CMP, NP, and EP, respectively, using the feature subset selected by RFECV to input the model for training, and using five-fold cross-validation for model validation and evaluate.According to the average of 30  with 30 random seed numbers, the ROC curve of the model for predicting WHO/ISUP nuclear grade by CT image features of three stages is obtained as shown in Fig. 4. In dataset1 and dataset2, the average accuracy, sensitivity, specificity, AUC and F1 score of the three classifiers are shown in Table 2.We find that the NP has better robustness in discriminating kernel grades, while all three classifiers exhibit good performance (Dateset1, SVM: AUC = 0.82 ± 0.05, RF: AUC = 0.82 ± 0.05, XGB: AUC = 0.81 ± 0.04; Dateset2, SVM: AUC = 0.84 ± 0.14, RF: AUC = 0.91 ± 0.05, XGB: AUC = 0.95 ± 0.02).We select 10 features that appear more than twice in the feature subsets of dataset1 selected by the three classifiers respectively in each stage, normalize them, and draw a heat map of the distribution of high and low kernel hierarchical features, the results are shown in Fig. 5.Among them, at each stage, three features, glcm_Idmn, gldm_DependenceVariance, and shape_Sphericity, all appear frequently, which fully demonstrates the importance of these three features.Compared with these features, the order of importance of firstorder_Skewness,  www.nature.com/scientificreports/glszm_LargeAreaLowGrayLevelEmphasis, and other features is relatively backward.The Mann Whitney U test was performed on the 10 features of the three stages, and the pairwise differences between the three stages in high-grade and low-grade groups were drawn as p-value histograms respectively, as shown in Fig. 6.The p-value less than 0.05 was considered to be different between the three stages.Among them, it is worth noting that although feature glcm_Idmn has a high frequency in other stages, it has a very high frequency in the EP stage, which is obviously different from other stages.In particular, the performance of the gldm_DependenceVariance feature is not obvious in other stages, but it has a relatively obvious high frequency in the NP stage, showing differences.Other features with high frequency and significant differences between CMP and other stages are mostly shape and texture features, which are consistent with the previous clinical characteristics analysis.Specifically, in our study, we found that 4/5 of the selected important features were texture features, except for one first-order feature and another shape feature, while the overall discrepancy of features among the three stages in the low nuclear grade group was more significant than in the high nuclear grade group (mean p-value less than 0.08), which indicating that CT image texture in low-grade tumor had greater differences in three enhancement stages.In addition, the first-order feature Skewness ranked forward in the assessment of feature importance during NP and showed a significant difference compared with the other two stages (p < 0.05), which indicated that the CT image intensity fluctuation of tumors is the largest during NP and the lesion area is easier to detect on it.The value of the feature glcm_Idmn in EP is significantly larger than that in CMP (p < 0.01) and NP (p < 0.01), which mainly measures the local uniformity of the image.This indicates that the contrast enhancement rate of ccRCC CT images in EP basically disappeared, resulting in poor performance in distinguishing nuclear grades.Additionally, altered cellular permeability, abnormal angiogenesis, and necrosis of malignant tumors cause structural changes in the tissue and resulting in tumor heterogeneity in NP and CMP.However, only one shape feature appeared in the selected important features, and it did not show a significant difference in the three stages regardless of whether it was in a high-grade group or a low-grade group, so we speculate that the value of tumor shape feature is approximately same in distinguishing nuclear grades in three stages.
For the XGBoost optimal model trained in dataset2, we visualize the feature decision of NP through the shape value as shown in Fig. 7. SHAP values are a method used to assess feature importance in machine learning models.They quantify the impact of each feature on model predictions.By understanding SHAP values, we can gain insights into the contribution of individual features and improve model interpretability.From SHAP values, the features firstorder_Kurtosis and firstorder_90Percentile exhibit the most significant impact on the model's predictive results, followed by the features glcm_ClusterShade, firstorder_10Percentile, firstorder_Skewness, and shape_Elongation.First-order features such as Kurtosis, 90Percentile, 10Percentile, and shape feature Elongation demonstrate an inverse relationship (i.e., when their values are high, they have a negative effect on the model's predictions, indicating a tendency to predict low-grade ccRCC; conversely, when their values are low, the model tends to predict high-grade ccRCC).This suggests that first-order features represent the grayscale distribution, with low-grade ccRCC exhibiting a more uniform grayscale value distribution and high-grade ccRCC showing the opposite pattern.Texture feature ClusterShade and first-order feature Skewness show a direct proportionality with the predictive results, indicating that higher-level tumor CT images exhibit a higher degree of pixel value www.nature.com/scientificreports/aggregation within the local region, leading to more heterogeneous texture features in the image.This implies the presence of a wide range of pixel value aggregations in the image, possibly reflecting irregular or heterogeneous structures within the tissue, corresponding to complex texture features within the tumor region.

Deep learning method
In order to further study the multi-phase comparison of WHO /ISUP nuclear grading diagnosis models based on enhanced CT images of renal clear cell carcinoma, we also designed an end-to-end nuclear grading classification model of deep learning, in which three end-to-end deep learning models, VGG11, ResNet18 and GoogLeNet, were constructed based on three enhancement periods: CMP, NP, and EP, and trained on the test set.Like the radiomics methods, the deep learning model also needs to preprocess the input CT image data before training.In addition to adjusting the window width and level, resampling, and standardizing the gray value of the image as in the previous chapter, it is also necessary to design preprocessing steps such as image data enhancement and region of interest interception for the deep learning model to increase the diversity of training data and adapt to the input requirements of the model.In the experiment, the CT image data of the training set was enlarged and a few samples were balanced.Specifically, after each tumor area image was taken as a block, it Finally, in dataset1 and dataset2, the average accuracy, sensitivity, specificity, AUC and F1 score of the three classifiers are shown in Table 3.We also find that the NP has better robustness in discriminating kernel grades.Among them, ResNet18 shows the best performance (Dateset1, VGG11: AUC = 0.59 ± 0.04, ResNet18: AUC = 0.81 ± 0.02, GoogLeNet: AUC = 0.78 ± 0.04; Dateset2, VGG11: AUC = 0.68 ± 0.11, ResNet18: AUC = 0.87 ± 0.07, GoogLeNet: AUC = 0.57 ± 0.06).In addition, as shown in Fig. 8, we also draw the heat map of the region of interest of images in different periods when ResNet18 classifies high-grade and low-grade ccRCC.
From the medical point of view, this enhancement mode in different periods is determined by the pathological characteristics of ccRCC.At CMP stage, ccRCC tumor began to get blood supply from renal parenchyma.Because of the high metabolic activity of ccRCC tumor cells, they would absorb more contrast media, which led to the gradual and significant enhancement of the tumor.However, in NP stage, the contrast agent began to penetrate into renal pelvis and ureter, and ccRCC reached a stable peak enhancement at this time, that is, the absorption of contrast agent in tumor tissue reached the highest point, which led to the most obvious contrast enhancement performance in tumor area on CT images.Therefore, this peak enhancement makes the characteristics of tumor more obvious on the CT images of NP, and in CMP stage, some CT images are close to the peak enhancement of NP, which may lead to the performance of some models similar to NP.Secondly, in EP stage, the contrast agent began to be excreted from the kidney, and the enhancement degree of ccRCC gradually weakened, so the tumor image characteristics were not significant, which led to the poor performance of the model.

Discussion
Our research showed that NP was the optimal stage to diagnose WHO/ISUP nuclear grades, and the model performance was more robust.By comparing features extracted from CT images of different enhancement stages, and building machine learning models and deep learning models, we found that image features differed by enhancement stage at the same nuclear-grade level.Moreover, the enhancement intensity has the greatest difference between high-grade and low-grade in NP.
There is no consistent result from previous studies on which enhancement phase should be used to differentiate the ccRCC WHO/ISUP nuclear grades.Previous studies have assessed the textural features of chromophobe RCC and ccRCC on monophasic CT and predicted the potential discriminative role of Furhman nuclear grade 15,29,30 .Another study investigated the association between Furman nuclear-grade and CT images in three-phase ccRCC involving the plain scan period, CMP, and NP 31 .The result also confirmed that CT radiomic features can be considered as a useful and promising noninvasive methodology for preoperative evaluation www.nature.com/scientificreports/Fuhrman grades in ccRCC.However, in recent years, the Furhman nuclear grading system has been controversial because of its poor interpretability and uncertain prognostic value [32][33][34] , and it was gradually replaced by the WHO/ISUP nuclear grading system, which was proposed by the International Urological Association and can accurately distinguish the nuclear grade of renal cancer patients 7 .The WHO/ISUP nuclear grading system was used in our study as the gold standard for differentiating nuclear grading, which has greater research value than previous studies.In addition, some studies have analyzed texture features and used them to differentiate WHO/ISUP nuclear grades.Nevertheless, they only included single-phase 19 , two-phase 20 , or three-phase 21,35,36 CT, and no studies included images of the excretion period.To our knowledge, only one study included the EP and compared four-phase CT in predicting Furhman nuclear grade 37 .The results showed that image features extracted from the unenhanced phase CT images demonstrated a dominant classification performance, which was inconsistent with previous studies.Here, we performed a comprehensive comparative analysis of the three  enhancement phases (including EP) and found that the model based on the nephrography period showed robust and excellent performance.In a study on which phase should be used to differentiate ccRCC from other renal masses, the texture features of renal masses in three contrast enhancement phases were evaluated, including non-contrast enhancement [NECT], cortico-medulla [CM], and nephrogram [NG].The results showed that the ccRCC varies in different enhancement stages, and the enhancement stage was an essential variable in the texture analysis of renal masses 38 .In addition, another texture analysis study on predicting Furhman nuclear grades of ccRCC discovered that after Laplace Gaussian filtering, there is a statistically significant difference between the entropy (fine) of CMP and the entropy (fine and coarse) of NP 16 .Studies have shown that NP is considered the most sensitive period for detecting and describing lesions and it is superior to CMP images in lesion detection and superior or equal to CMP images in lesion characterization, which is because the enhancement of the tumor is lower than that of the surrounding renal parenchyma during NP 39 .In our study, CMP showed the equivalent performance to NP when using the SVM classifier, but the NP was overall better than the CMP in terms of model stability.On the other hand, the texture features of the EP were poor in detecting high and low nuclear grade tumors, probably because the differences in enhancement rates partially disappeared as the uptake of the contrast agent continued in the slowly enhancing tumors.
We also find that the classification performance of deep learning model is equivalent to that of machine learning model, even worse than that of machine learning model.On the one hand, due to the lack of data, on the other hand, it is insensitive to the subtle changes in the image between multiple enhancement periods.The main limitation of our study is that the dataset is relatively small due to the strict inclusion criteria, which required the dataset to contain three stages.However, splitting the independent test set from the small sample size dataset will further exacerbate the data scarcity problem, not only that, our research intention is not to build a model that may have higher classification accuracy, but to conduct a comparative evaluation and analysis of the three stages.At the same time, in order to reduce the result bias caused by the small dataset, conducted extensive internal verification by setting multiple oversampling random seed points, and obtained consistent and stable results, which further increased the reliability of our results.Besides, this study is retrospective, and prospective studies are warranted in the future to further validate the performance of these models in distinguishing high-and low-grade nuclear grades at different stages.Besides, dataset 2 is an external independent verification set.We verified the optimal machine learning model and deep learning model that we trained from dataset 1 on dataset 2. Both the intermediate process data (such as the characteristics of extraction and screening, see the Supplementary Table ) and the final results are consistent with our results on dataset 1, with only a small range of errors, which we think are within the allowable range of experiments on different data sets.
At present, the machine learning classification algorithm is mainly applied to CT images, and the medical image data used in this study is limited to CT images.However, if the patient's pathological images and other modal data such as gene sequencing can be included at the same time, more comprehensive disease characteristics and biomarkers can be obtained.A multi-modal fusion model can be designed for multi-stage diagnosis of pathological nuclear grading of ccRCC, which will help to reveal the correlation between different modal data and the potential biological mechanism.So as to better utilize the complementarity and correlation between these data and provide more comprehensive and credible auxiliary information for clinical diagnosis, like the presence of necrosis, central scar presence, renal sinus, and upper urinary tract involvement and extrarenal extension.

Conclusion
In conclusion, we compared and explored the pros and cons of different phases based on ccRCC multiphase contrast-enhanced CT scan images in distinguishing high nuclear-grade and low nuclear-grade, and the results showed that the nephrography phase was the best period.Firstorder_Kurtosis and firstorder_90Percentile feature play a vital role in the classification task.
This conclusion has important reference value for clinicians to make preoperative evaluation and make treatment plans, and is of great significance to improve the early screening rate of renal cell carcinoma and improve the prognosis of patients.

Figure 1 .
Figure 1.The technical route flow chart of the whole research.

Figure 3 .
Figure 3.An example of the manual segmentation on multiphase CT of high-grade and low-grade ccRCC.(a-c) show the tumor regions of CMP, NP, and EP.Everyone also shows the ROI regions drawn based on the CT images.

Figure 4 .
Figure 4. ROC curves of the SVM, RF, and XGB classifiers based on the CT image of CM,EP, and NP in Dateset1 and Dateset2: (A-C) show the ROC covers based on CMP,EP, and NP in Dateset1; (D-F) show the ROC covers based on CMP、EP and NP in Dateset2.

Figure 5 .
Figure 5.The heat map shows the distribution of normalised texture feature values between low and high nuclear grade ccRCC based on CMP, NP, and EP.Difference in colors and their shades indicates dissimilarity of the corresponding texture value parameters.

Figure 6 .
Figure 6.Distribution map of the difference p-values of characteristics in the two-point period: (A) high-level group; (B) low-level group.

Figure 7 .
Figure 7. Visualization of shap value feature decision of XGBoost optimal model trained in dataset2.

Figure 8 .
Figure 8. Visualization of attention areas for classification of ResNet18.The first line is a high-grade ccRCC, and the second line is a low-grade ccRCC.(a-c) show the tumor regions of CMP, NP, and EP.
From the point of view of the training model, if the number of samples in a certain category is small, then the information provided by this category is too small, and the model will be more inclined to a large number of categories, which is obviously not what we want to see.The high-grade and low-grade data distributions are not balanced in our study, there were 10 cases of low grade and 41 cases of high grade, which would directly result in the model being more capable of learning the majority class (low-grade) than the minority class (high-grade) when training the model.Synthetic minority oversampling technique (SMOTE) is used by us to solve this problem, which changes the data distribution of an imbalanced dataset by adding generated minority class samples,

Table 1 .
Clinical characteristics of patients in low-grade and high-grade groups.

Table 2 .
The machine learning models on the CT image features of CMP、EP and NP in Dataset1 and Dataset2.The best results in the table are marked in bold, and the second-best results are underlined.

Table 3 .
The deep learning models on the CT image features of CMP, EP and NP in Dataset1 and Dataset2.The best results in the table are marked in bold, and the second-best results are underlined.